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Abstract 

The topological structures of the Internet and the Web 
have received considerable attention. However, there has 
been little research on the topological properties of individ- 
ual web sites. In this paper, we consider whether web sites 
(as opposed to the entire Web) exhibit structural similari- 
ties. To do so, we exhaustively crawled 18 web sites as di- 
verse as governmental departments, commercial companies 
and university departments in different countries. These 
web sites consisted of as little as a few thousand pages to 
millions of pages. Statistical analysis of these 18 sites re- 
vealed that the internal link structure of the web sites are 
significantly different when measured with first and second- 
order topological properties, i.e. properties based on the 
connectivity of an individual or a pairs of nodes. However, 
examination of a third-order topological property that con- 
sider the connectivity between three nodes that form a tri- 
angle, revealed a strong correspondence across web sites, 
suggestive of an invariant. Comparison with the Web, the 
AS Internet, and a citation network, showed that this third- 
order property is not shared across other types of networks. 
Nor is the property exhibited in generative network models 
such as that of Barabdsi and Albert. 

Index Terms - Hypertext systems, Topology, Modeling. 

1 Introduction 



The Web has become a global tool for sharing informa- 
tion. It can be represented as a huge graph which consists 
of billions of hypertext web pages connected by hyperlinks 
pointing from one web page to another ||4l [TTll . Each web 
page is part of a larger web site, which is loosely defined as 
a group of web pages whose URL addresses use the same 
domain name, such as lcs . ucl . ac . ukl and fieee . org] 



Studying and understanding the Web's topological struc- 
ture is important as it may lead to improved techniques for 
information retrieval. Link structure of the Web has been 
used in algorithms like Pagerank [16] and HITS (9) to es- 
timate the importance of web pages, and in (8j [3] [10) for 
community discovery and clustering. These algorithms do 
not typically use the internal link structure within a web site, 
but rather, rely on external links between web sites. Never- 
theless, the internal structure of a web site is important. For 
example the statistical property of web site link structure 
may be used as an informative measure of web site quality, 
e.g. navigability l20l . 

There is surprisingly little study of the structural proper- 
ties of web sites in general. Certainly, it is well known that 
examination of the graph structure of an individual web site 
can be used to calculate the mean diameter of the web site, 
and other metrics, that can then be used to infer properties 
regarding the navigability of the web site. However, we are 
unaware of prior work that provides a statistical topolog- 
ical characterization of all web sites. As such, web sites, 
as opposed to the Web, are often considered to exhibit an 
arbitrary statistical topological structure. 

However, this study reveals that the topology of web sites 
is not arbitrary. In fact, examination of the triangle coeffi- 
cient (the number of triangles of a node) as a function of de- 
gree (the number of links of the node) reveals a very strong 
correlation across web sites, suggestive of a possible invari- 
ant of web site link structure. Moreover, this third-order 
property varies across other networks, such as the Web, the 
Internet and citation networks. Thus, it appears to strongly 
characterise web sites. 

This paper is organised as follows. In Section [2] we in- 
troduce a number of topological metrics which have been 
used to characterise and compare network structures. In 
Section[3]we introduce the datasets used in this study. These 
consist of 18 web sites which vary in size from a few thou- 
sand pages to millions of pages. The web sites cover a broad 



range of entities: 9 government sites from various countries, 
3 commercial sites, 3 educational sites and 3 very large 
sites, (IEEE, Wikipedia, Yahoo!). In Section|4]we present 
our statistical results and discuss the implications. In addi- 
tion to comparing data across web sites, we also compare 
with (i) subsets of the Web, (ii) a citation network, (iii) the 
AS -level Internet network, and (iv) the generative model of 
Barabasi and Albert |2|. Section [5] summarises the key re- 
sults and discusses how this work can be used to improve 
generative models of hypertext networks. 

2 Definition of Topological Properties 

We briefly review and define the following topological 
properties, which are grouped into three orders according 
to the scope of information required to compute them lfl2ll . 
These are (i) the l s *-order properties, e.g. degree distribu- 
tion, (ii) the 2 Ild -order properties, e.g. degree correlation 
and rich-club connectivity, and (iii) the 3 -order proper- 
ties, e.g. triangle coefficient and clustering coefficient. 

2.1 The l st -Order Properties 




Figure 1. Example of (a) an undirected graph 
and (b) a directed graph. 

The link structure of a web site can be described as an 
undirected graph on which a node represents a web page 
and a link denotes the existence of at least one hyperlink 
connection between two nodes. The connectivity, or degree 
k, of a node is defined as the number of links, or neighbours, 
the node has. For example in Figure la, node A has four 
neighbours B, C, D and E, and its degree fc^ = 4. A 
web site can also be described as a directed graph on which 
each link has a direction pointing from one node to another. 
The in-degree k in of a node is then defined as the number 
of incoming links and the out-degree k out the number of 
outgoing links. For example in Figure lb, node A has three 
incoming links from nodes B, C and E, i.e. fej„ = 3, and 
two outgoing links to nodes C and D, i.e. k ou t = 2. This 
paper studies web sites link structure as undirected graphs 
unless specifically stated. 

The degree of a node measures a node's local connectiv- 
ity. Topological properties calculated by using the degree 



of individual nodes are classified as l s *-order properties, 
e.g. the average degree k of nodes in a network. 

2.1.1 Degree Distribution 

The most studied topological property for large networks is 
the degree distribution P(k), which is defined as the proba- 
bility that a randomly selected node has degree k. A random 
graph [7] is characterised by a Poisson degree distribution 
where the distribution peaks at the network's average de- 
gree. It has been reported that a number of networks Q 
follow a power-law degree distribution, 



P(k) ~ fc~ 7 , < 2 7 < 3. 



(1) 



This means that most nodes have very few links, while a 
few nodes have a very large number of links. 

2.2 The 2™ d -Order Properties 

Topological properties are classified as 2 nd -order prop- 
erties if they are based on the degree information of the 
two end nodes of a link, such as the joint degree distri- 
bution P(k, k') 0, which is the probability that a ran- 
domly selected link connects a node of degree k with a 
node of degree k' . The 2 Tld -order properties provide a 
more complete description of a network's structure than 
the l s *-order properties. For example the degree distri- 
bution can be obtained from the joint degree distribution: 
P{k) = {k/k)Y Jkl P{k,k'). 

2.2.1 Degree Correlation 

The nearest-neighbours average degree, k nn , of fc-degree 
nodes ifTTl l22l . is a projection of the joint degree distribu- 
tion, given by 



k n n(k) 



kj: kl k'p(k,k') 
kP(k) 



(2) 



A network is called an assortative network if k nn (k) in- 
creases with k, which means nodes tend to attach to sim- 
ilar nodes, i.e. high-degree nodes to high-degree nodes and 
low-degree nodes to low-degree nodes ('assortative mix- 
ing'). Many social networks are assortative networks. A 
network is a disassortative network if k nn (k) decreases with 
k, i.e. high-degree nodes tend to connect with low-degree 
nodes and vice versa ('disassortative mixing'). This is the 
case for most information and communications networks. 

A network's degree correlation, or mixing pattern, can 
be summarised by a single scalar called the assortative co- 
efficient |[T4l[T5l. 



£- 1 E«*<*i-[£- 1 Ei£(* + *)] 2 



(3) 



where L is the number of links and Si, di are degrees of the 
end nodes of the ith link with i = 1,2, L. The value of 
a is in the range of [—1, 1]. For assortative networks a > 
and for disassortative networks a < 0. 

2.2.2 Rich-Club Connectivity 

The rich-club connectivity ll26l [5 1 measures how tightly 
the high-degree nodes, rich nodes, interconnect with them- 
selves. If N >k is the number of nodes with degrees large 
than k and they share E >k links between themselves, the 
rich-club connectivity is defined as 



2E 



>k 



N >k (N >k ~iy 



(4) 



where iV>/ £ (A^>/ £ — l)/2 is the maximum possible num- 
ber of links that the N >k nodes can have. For example in 
Figure la, there are five nodes (A, B, C, D and E) with 
degrees larger than 2 and they have 8 links between them, 
thus <p(2) = 5x ^ 5 8 L 1 y 2 = 0-8, which means the 5 best- 
connected nodes are 80% fully interconnected. The rich- 
club connectivity is a 2 nd -order property because whether 
a link belongs to E >k depends on the degrees of the link's 
two end nodes. 




Figure 2. Graphs (a) with a rich-club and (b) 
without a rich-club. 



2.3.1 Triangle Coefficient 

The triangle coefficient A is defined as the number of tri- 
angles a node shares, which is equivalent to the number of 
links among the node's neighbours [25]. Triangle is the ba- 
sic unit for network redundancy. The more triangles, the 
more alternative paths between nodes. 

In-triangle and out-triangle coefficients On a directed 
graph, a node's neighbours can be divided into two groups: 
in-neighbours, which are connected with incoming links; 
and out-neighbours, which are connected with outgoing 
links. An in-triangle of a node consists of the node and two 
of its in-neighbours, and an out-triangle consists of the node 
and two out-neighbours. For example in Figure lb, node A 
has two in-triangles ABC and ACE and one out-triangle 
ACD, therefore node A's in-triangle coefficient A;„ is 2 
and out-triangle coefficient A ou t is 1 . 

2.3.2 Clustering Coefficient 

A more widely studied 3 rd -order property is the clustering 
coefficient C, which is defined as the ratio of actual links 
among a node's neighbours to the maximal possible number 
of links they can share ll23l . The clustering coefficient of a 
node can be given as a function of a node's degree and its 
triangle coefficient, 



C = 



A 



k(k-l)/2 



(6) 



Two nodes with different triangle coefficients can have the 
same clustering coefficient. For example in Figure la, node 
B has three neighbours and one triangle and node C has six 
neighbours and five triangles (CBA, CAD, CAE, CED 
and CFH). However, their clustering coefficients are the 



The rich-club connectivity is a different projection of the 
joint degree distribution, 



d>(k) = 

[N £^ +1 P{k')} ■ W p ( k ') 1] ' 

(5) 

where N is the total number of nodes and k max is the max- 
imum degree in a network. The rich-club connectivity does 
not trivially relate with the degree correlation l24l . For ex- 
ample the two graphs shown in Figure 2 are both disassor- 
tative networks, but for the 3 best-connected nodes in Fig- 
ure 2a, = 1, and in Figure 2b, = 0. 



A B = 



1 



3(3-l)/2 



6(6- l)/2 



= A, 



Therefore one should use the triangle coefficient to infer the 
clustering information of nodes with different degrees. 

3 Datasets 

Here we briefly summarise the various datasets used in 
this study. 

3.1 Web sites 



2.3 The 3 rd -Order Properties 

The 3 rd -order properties are based on connectivity infor- 
mation between three nodes that form a triangle. 



We exhaustively crawled the 18 web sites of the or- 
ganisations listed in Table 1: 1) the national audit of- 
fice or equivalent of Canada (AO-CA), Italy (AO-IT), the 
United Kingdom (AO-UK) and the United States (AO-US); 



Table 1 . Properties Of The Datasets 





Web Site 


Number 


Number 


Average 


Assortative 


Average 


Dataset 


domain name 


of nodes 


of links 


degree 


coefficient 


triangle coef. 


AO-CA 


cac.gc.ca 


12,730 


120,485 


15.94 


-0.35 


159.78 


AO-IT 


corteconti.it 


32,614 


200,516 


11.96 


-0.40 


186.11 


AO-UK 


nao.gov.uk 


4,027 


25,453 


11.84 


-0.36 


89.40 


AO-US 


gao.gov 


19,625 


223,998 


21.69 


-0.63 


289.37 


r ;/"\ ATT 

rO-AU 


dfat.gov.au 


in 1 y| a 

29,140 


791,039 


53.25 


-0.78 


1,066.30 


FO-CZ 


mzv.cz 


31,246 


778,163 


45.23 


-0.13 


1,134.06 


FO-DE 


auswaertiges-amt.de 


46,219 


2,234,535 


94.10 


-0.56 


4,439.89 


rO-Jr 


mot a. go. j p 


co ia/: 


493,861 


1 H 1 1 
1 /. 1 1 


a in 
-0.3 / 


1 nn T3 
1 / 1 .15 


FO-UK 


fco.gov.uk 


33,280 


694,255 


36.29 


-0.16 


884.54 


COM-HSBC 


hsbc.co.uk 


51,043 


68,454 


2.62 


-0.05 


7.97 


COM-NbAl 


next.co.uk 


/4,989 


jj /,4oo 


1 A 11 

14. 1 1 


-0.4/ 


182.55 


COM-SKODA 


skoda-auto.com 


49,341 


727,119 


28.39 


-0.30 


292.12 


EDU-AUCK 


arts . auckl and . ac . nz 


12,457 


129,870 


17.64 


-0.21 


258.13 


EDU-UCB 


haas .berkeley. edu 


100,025 


373,521 


6.90 


-0.09 


84.85 




cs.ucl.ac.uk 


J0,Jj4 




1 C\ C 1 


-U. 1 J 


/U. j4 


LARGE-IEEE 


ieee.org 


1,977,923 


5,614,610 


5.54 


-0.05 


57.92 


LARGE-WIKI 


zh.wikipedia.org 


1,913,510 


8,249,248 


8.12 


-0.13 


64.54 


LARGE- YAHOO 


yahoo.com 


3,448,289 


12,039,165 


6.72 


-0.08 


81.69 


Web 




43,425 


173,696 


7.96 


-0.12 


38.43 


Citation network 




244,864 


897,170 


7.33 


-0.08 


4.20 


AS Internet 




9,200 


28,957 


6.30 


-0.24 


21.37 


BA model 




10,000 


30,000 


6.00 


-0.02 


0.16 



2) the foreign office or equivalent of Australia (FO-AU), 
the Czech Republic (FO-CZ), German (FO-DE), Japan 
(FO-JP) and the UK (FO-UK); 3) commercial web sites, 
such as HSBC bank in the UK (COM-HSBC), the UK re- 
tailer NEXT (COM-NEXT) and the automobile company 
SKODA (COM-SKODA); 4) educational web sites, such 
as the Faculty of Arts at the University of Auckland, New 
Zealand (EDU-AUCK), the Haas School of Business at 
the University of California at Berkeley (EDU-UCB), and 
the Department of Computer Science at University College 
London (EDU-UCL); and 5) three very large web sites with 
millions of web pages, such as the IEEE (LARGE-IEEE), 
Wikipedia in the language of Simplified Chinese (LARGE- 
WIKI) and Yahoo! (LARGE- YAHOO). 

We used the Nutch 1.6.0 crawler 
( |http : / / lucene . apache . org/ nutch] l. Each 
crawl was started from a web site's homepage and was 
restricted to the web site's domain as listed in Table 1. 
The crawler was configured to allow for complete site 
acquisition and collected all web pages up to a depth of 
18. The default parameters were a 5-second delay between 
requests to the same host, and 10,000 attempts to retrieve 
pages that fail with a 'soft' error lf20l . We discarded 
hyperlinks pointing to web pages outside the web site's 
domain and removed self-loops and duplicated hyperlinks. 



We are aware of a number of available data sources 
of the Web. We did not extract web sites data from them 
because they aim to sample the entire Web and contain very 
incomplete information of the internal link structure of in- 
dividual web sites. For example the Stanford WebBase data 
flhttp : //dbpubs . Stanford . edu : 8091/~testbed/doc2/WebB 
contains only 400 web pages with NASA's domain name 
( jnasa . orgj i. 



3.2 Web 



WTlOg is a mega dataset of the Web proposed by the 
annual international Text REtrieval Conference (TRECs, 
|http : / /tree . nist . gov) . WTlOg is constructed 
from more than 320 gigabytes of archived data containing 
1.7M web pages and hyperlinks between them. It is re- 
ported that WTlOg retains properties of the larger Web ll2D 
and has been used as a data resource for research on Web 
retrieval and modelling. We randomly sampled 10 subsets 
of WTlOg, each of which contains 50,000 web pages and 
links between those pages. In this paper we use the average 
properties of the 10 WTlOg subsets as an approximation of 
the Web's link structure. 



3.3 Citation Network 

The citation network |fl9l data was extracted from 
the online computer science publication database CiteSeer 
( |http : //citeseer . ist .psu . edu/| l. The CiteSeer 
data contain 575K entries, from which we extracted 
244,864 records having at least one reference (outgoing 
link) or citation (incoming link). 

3.4 AS Internet 

The Internet topology at the autonomous systems (AS) 
level has been extensively studied in recent years lfl8l l25l 
[T3][T2 1. On the AS Internet, nodes represent Internet service 
providers and links represent connections between them. In 
this paper we use the AS Internet dataset ITDK0304 col- 
lected by CAIDAffl. 

3.5 Barabasi- Albert Model 

The Barabasi and Albert (BA) model [2 1 has been widely 
used in the study of complex networks. This model shows 
that a power-law degree distribution can be produced by two 
mechanisms: growth, where the network "grows" from a 
small random graph by attaching new nodes to old nodes 
in the existing system; and preferential attachment, where a 
new node is attached preferentially to nodes that are already 
well connected. 

4 Results 

Here we summarise our experimental findings. We ex- 
amine a variety of first, second and third-order topological 
properties and compare them across the various web sites. 
We then compare the topological properties of web sites 
with other networks, specifically, the Web, AS network, a 
citation network, and the generative network of Barabasi 
and Albert. 

4.1 Comparison between the web sites 

4.1.1 The I s * And 2™ d -Order Properties 

As shown in Table 1 , the size and the average degree of the 
web sites vary significantly. The foreign office web sites 
have very large average degrees, whereas the three large 
web sites with millions of web pages have very small aver- 
age degrees. Figure^, b and c illustrate the degree distribu- 
tion P(k), the degree correlation k nn (k), and the rich-club 
connectivity tfi(k) of the 18 web sites on a log-log scale. 
Also shown are their average properties, depicted by cir- 
clefQ. It is clear that the I s * and 2™ d -order properties of the 

1 The average degree distribution P(k) is obtained as such: for a 
given fe, if at least X > 12 of the 18 web sites have P(k) > 0, then 



web sites exhibit huge variations over several orders of mag- 
nitudes. Thus, the web sites cannot be well characterised by 
the average of these properties. For example, in Figure[3j;, 
some web sites with nodes of degree k > 100 are almost 
fully interconnected with themselves, i.e. <f> w 1, whereas 
in other web sites the interconnectedness is much looser, 
with 01ess than 0.001. 

4.1.2 The 3 rd -Order Properties 

Figure[3ji shows the complementary cumulative distribution 
of the triangle coefficient P C (A), which is the probability 
that a node's triangle coefficient is larger than A. Figure^ 
shows the relationship between triangle coefficient and de- 
gree A(fc), i.e. the average triangle coefficient of fc-degree 
nodes. Although the web sites do not show an agreement 
on P C (A), they do exhibit a clear correspondence on A(fc). 
Some web sites have sharp spikes on their A(fc) curves. 
These spikes reflect the existence of star-like subgraphs in 
these web sites, e.g. a web page with a long list of hyper- 
links pointing to documents or images. Compared to the 
large number of web pages contained in a web site, the lim- 
ited number of such spikes are not statistically significant. 

The average over all the web sites of the triangle coeffi- 
cient as a function of degree is also depicted in Figure [3^, 
see circles, and is a smooth curve, which well represents all 
the web sites. This is suggestive of a structural invariant of 
web sites. 

Figure |3f shows the web sites show a similar correspon- 
dence on the relationship between clustering coefficient and 
degree C{k). Note that the average clustering coefficient, 
depicted by circles, is not a monotonic function of degree. 
This is because the clustering coefficient is itself a func- 
tion of the degree and triangle coefficient. In the following 
we do not consider C(k) further, as the triangle coefficient, 
A(fc), contains all information provided by C(k). 

4.2 Comparison with other networks 

Here we compare the topological properties of the aver- 
age over all web sites, with those of other networks, specif- 
ically the Web, a citation network, the AS Internet, and the 
BA model. 

4.2.1 Degree Distribution 

Figure |4^ shows that the degree distribution of the Web, 
the citation network, the AS Internet and the BA model 
can be well described as a power-law P(k) ~ fc~ 7 with 
2 < 7 < 3. However the average degree distribution of 
the web sites is very different: for k < 10 or k > 30, it 

P(k) = X- 1 J^i p i( k ) where * = !> 2 --- x - ° ther 

average properties 

are calculated in similar ways. 




A k k 

Figure 3. Topological properties of the web sites: a) degree distribution, P(k); b) nearest-neighbours 
average degree of fc-degree nodes, k nn {k); c) rich-club connectivity as a function of degree, <f>(k); 
d) complementary cumulative distribution of triangle coefficient, P C (A); e) correlation between trian- 
gle coefficient and degree, A(fc); and f) correlation between clustering coefficient and degree, C(fc). 



can be described as a power-law; but for 10 < k < 30, the 
distribution increases exponentially with degree. 

4.2.2 Degree Correlation 

Figure shows that the citation network and the AS Inter- 
net are typical disassortative networks where k nn decreases 
monotonically with k. The BA model is an example of a 
neutral network where k nn does not change with k. For the 
average of the web sites, and the Web, k nn first increases 
and then decreases with k, and peaks at k — 30 and k = 15 
respectively. For large degrees, the average k nn of the web 
sites is significantly larger than all other networks. 

4.2.3 Rich-Club Connectivity 

Figure 2J; shows that the AS Internet has the highest rich- 
club connectivity, with a fully interconnected core, i.e. 
4>{k) = 1, for k > 200. The citation network has the low- 
est rich-club connectivity. Although the BA model is very 
different from the web sites when measured by k nn (k), the 
two exhibits similar rich-club connectivity for k > 10. 



4.2.4 Distribution of Triangle Coefficient 

Figure @}l shows that the web sites contain significantly 
more triangles than all other networks. The high density 
of triangles ensures the navigability of the web sites. 

4.2.5 Triangle Coefficient as a Function of Degree 

Figure [5 shows that, in general, all the networks exhibit 
a positive correlation between triangle coefficient and de- 
gree. This is because the larger the degree of a node, the 
more neighbours a node has, and thus the higher the chance 
of forming triangles. As discussed in Section 14.1.21 all the 
web sites exhibit a very similar relationship between trian- 
gle coefficient and degree, that is well characterised by the 
average over all the web sites. The average correlation be- 
tween triangle coefficient and degree of the web sites can be 
closely fitted by a function given as 

f(x) = .064x 2 - 94 - a361 °Sio( a; ) 

or 

log 10 (/(») = -0.3579 log? (a;)+2. 9432 log 10 (x)-l. 1907. 




Figure 4. Comparison between the average of the web sites and (i) the Web, (ii) a citation network, (iii) 
the AS Internet, and (iv) the BA model: a) degree distribution, P(k); b) nearest-neighbours average 
degree of fc-degree nodes, k nn (k); c) rich-club connectivity as a function of degree, <j>(k); d) comple- 
mentary cumulative distribution of triangle coefficient, P, ( A); e) triangle coefficient as a function 
of degree, A(fc); and f) three triangle properties: triangle coefficient versus degree, A(fc); in-triangle 
coefficient versus in-degree, A in (A;j n ); and out-triangle coefficient versus out-degree, A out {k out ). 



It is clear that the relationship between triangle coefficient 
and degree is different from the other networks. The BA 
model exhibits the lowest number of triangles as a func- 
tion of node degree, followed by the citation network, and 
then the AS Internet. For degree k < 30, the Web data 
closely follows that of the average over web sites, but di- 
verges thereafter. 

4.2.6 In-Triangle and Out-Triangle 

Figure |4f examines the three relationships of (i) triangle 
coefficient versus degree A(fc), (ii) in-triangle coefficient 
versus in-degree Ai n (ki n ), and (iii) out-triangle coefficient 
versus out-degree A ou t(k ou t), for the citation network and 
the average over all 18 web sites. That is, here, we consider 
the networks as directed graphs. 

For the web sites, these three relationships closely over- 
lap one another. This means the probability of forming 
triangles with a web page's in-neighbours or with its out- 
neighbours are the same. However, for the citation network, 
Aj„(fcj n ) is one order of magnitude larger than A out (k ou t) 



for the same degrees. This means the probability of a paper 
forming triangles with its citations (in-neighbours) is sig- 
nificantly larger than it forming triangles with its references 
(out-neighbours). 

This structural difference between web sites and the ci- 
tation network may reflect their different evolution dynam- 
ics. For a citation network, when a paper is published all 
its references existed before the publication of the paper 
and, of course, cannot be changed. However, a paper can 
always acquire new citations, and these citations may refer- 
ence other citations (thus continuing to form triangles). In 
contrast, for a web site, web pages and their associated hy- 
perlinks can be added, deleted or revised at any time. For 
web sites, there is no equivalent to a reference to a page that 
remains static and unable to be changed in the future. 

5 Conclusion 

We examined a number of topological properties of hy- 
perlink data crawled from 18 diverse web sites. Our em- 



pirical results showed that the link structures of the web 
sites are significantly different when measured with 1st and 
2nd-order topological properties. This is probably to be ex- 
pected since the web sites are designed for different pur- 
poses and developed independently. However we observed 
that web sites share a common 3rd-order topological prop- 
erty, the relationship between triangle coefficient and de- 
gree. This common relationship is unexpected and sugges- 
tive of a topological invariant for web sites. Comparison 
with the Web, the AS Internet, a citation network and the 
Barabasi-Albert model showed that this third-order prop- 
erty is not shared across other types of networks. Thus, this 
property appears to strongly characterise web sites. The 
physical meaning of this 3rd-order property is that given the 
number of hyperlinks to and from a particular web page, we 
can statistically estimate how the web page's neighbouring 
pages are interlinked; and this statistical estimation is valid 
for all web sites. 

Further evaluation on a wider variety of web sites is 
needed to verify that this 3rd-order property is an invariant. 
If so, then the fundamental question is why? Possible expla- 
nations include standardised web site designing principles, 
popular web site developing tools, or universal evolution 
dynamics which fundamentally reflect the common nature 
and function of web sites as a way of organising and dis- 
seminating information. The answer to this question may 
prove valuable for research on a number of issues, such as 
modelling web site and other document networks, recom- 
mendations for building web sites in the future, optimizing 
search engine algorithms, and understanding the fundamen- 
tal principles governing the evolution of the Web. 
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